A Combined Theory Data-Driven Approach to Classifying Delinquency Risk in the Future of Families and Child Well-Being Study
Name: Nicholas Vietto
PhD Candidate at the University of Nebraska - Omaha
Research Interests: Biopsychosocial Criminology, Quantitative Methods, Data Visualization, Open-Science, Open-Source Software
Goal:
Classify delinquency risk at age 15 using data from ages 9 and 15 in the Future of Families and Child Wellbeing Study (FFCWS).
How:
Building on the findings and method of Chan et al. (2023), we implement a feed-forward neural network using the {tidymodels} framework in R.
Using a model with factors that included social, psychological, and biological domains, outperformed models using any single domain in predicting a CD diagnosis with 91.18% accuracy.
Using Future of Families and Child Wellbeing Study (FFCWS):
Expanded Sociological Domain: Incorporates rich socio-environmental predictors, including census tract variables, labor market and proximity to gun-violence incidents.
Incorporating Genetic Data: Specifically, incorporate genes involved in the serotonergic and dopaminergic pathways to examine the role of polymorphic variation.
Classifying Delinquency Risk rather than a CD diagnosis.
Socio-Environmental Domain
Parental Monitoring Scale (Focal Child, Year 15)
Neighborhood Collective Efficacy Scale (Focal Child, Year 15)
Conflict Tactics Scale (Focal Child, Year 15)
Material Hardship Scale (PCG, Year 15)
Psychological Domain
BSI 18 Anxiety Scale (Focal Child, Year 15)
Center for Epidemiologic Studies Depression Scale (CES-D) (Focal Child, Year 15)
Dickman’s Impulsivity Scale (Focal Child, Year 15)
Genetic Domain
Advanced Data Processing: Efficiently handles and analyzes large amounts of data to enhance predictive power.
Uncovering Complex Relationships: Identifies non-linear and higher-order interactions, especially in high-dimensional datasets, providing deeper insights into variable relationships (e.g., high dimensional data like image, audio, etc.).
Enhanced Predictive Accuracy: Continuously refines predictions through iterative learning, improving overall accuracy over time.
Further Reading: Mapping of machine learning approaches for description, prediction, and causal inference in the social and health sciences
LearnOpenCV
60/20/20 split for training, validation, and testing.
2128 observations after merging data
Genetic Data Constraints: Genetic information is confined to markers from the candidate gene era, potentially limiting genomic coverage.
Sample Size: The relatively small sample size may impact the robustness and generalizability typical for machine learning applications.
Age of Assessment: Age 15 may be early for assessing delinquency risk, as behaviors predictive of long-term patterns may not yet be fully evident.
Enhance Domain Optimization: Add features to maximize the model’s performance in each specific domain (e.g., adding labor markets for distal predictors in the sociological domain).
Evaluate Fairness Across Ethnicities: Assess the final model’s performance across different ethnic groups to ensure fairness, verifying it does not exhibit biases against social or minority groups.
Test mode on Year 22 data: Validate the model’s performance on the Year 22 data to assess its generalizability and predictive power.
Q u e s t i o n s ?
Data Modeling Culture
Primary Focus: Deriving causal inference
Approach: Emphasizes deductive reasoning
Process: Models the data-generating process to clarify relationships between X and Y
Culture: Grounded in methodologies developed primarily by statisticians
Algorithm Modeling Culture
Primary Goal: Maximizing predictive accuracy
Approach: Emphasizes inductive reasoning, with a focus on learning patterns directly from data
Process: Utilizes black-box models to capture relationships between X and Y
Culture: Rooted in methodologies developed primarily by computer scientists